## [1] 1599 13
This report explores a dataset containing various properties which affact red wine quality for 1599 observations
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
Display a few observations
## X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1 7.4 0.70 0.00 1.9 0.076
## 2 2 7.8 0.88 0.00 2.6 0.098
## 3 3 7.8 0.76 0.04 2.3 0.092
## 4 4 11.2 0.28 0.56 1.9 0.075
## 5 5 7.4 0.70 0.00 1.9 0.076
## 6 6 7.4 0.66 0.00 1.8 0.075
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 11 34 0.9978 3.51 0.56 9.4
## 2 25 67 0.9968 3.20 0.68 9.8
## 3 15 54 0.9970 3.26 0.65 9.8
## 4 17 60 0.9980 3.16 0.58 9.8
## 5 11 34 0.9978 3.51 0.56 9.4
## 6 13 40 0.9978 3.51 0.56 9.4
## quality
## 1 5
## 2 5
## 3 5
## 4 6
## 5 5
## 6 5
Display range of attribute X
## [1] 1 1599
Display length of unique values in X
## [1] 1599
## [1] 1 2 3 4 5 6
## 'data.frame': 1599 obs. of 12 variables:
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
X attribute appears to be index of observations, it does not contain any meaningful insights, thus, removed
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
Citric acid does not have a normal distribution and it tends to have most vauls fall between 0 and 0.75. However, its medians for each quality level tends to go up as quality does.This makes me wonder what’s relationship to the quality.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
Most red wine contains sulphates bewteen 0.33 to 1, the median goes up as quality level does,which could indicate some kind relation.
Alchohol level skewed to the right, I’m curious about how it’s connected to quality
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
It appears that the most red wines have lower alcohol content. and its median also goes up with quality level.
Quality attribute tends to be an categorical attribute, so I’d change it here to factor
## 'data.frame': 1599 obs. of 12 variables:
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
##
## 3 4 5 6 7 8
## 10 53 681 638 199 18
Most of red wines have quality bewteen 4 to 7
Dataset contains 13 variables and 1599 observations in total. All values are numbers and there are no missing values, however, variable X is intended to indexing, containing no meaningful insights, thus, it was removed from dataset Other findings:
1.Most red wines’ quality fall bewteen 4 and 7, larger value stands for better quality
2.Many variables have outliers
3.Some variables are skewed such as free.sulfur.dioxide
4.Some variables are not in normal distribution such as citric.acid and alcohol
These should be further analyzed later
The main feature is quality and influential ones such as citric acidity,sulphates and alcohol. I will investigate more on which features have stronger relationship with quality.
Every other 11 features are considered contributor in this dataset and will be carefully examined later to find out more insights
No
yes, citric.acid does not seem to have a regular distribution and it reminds almost evenly for the rest.
yes, I removed varable X which is meant to be index of dataset, thus it contains no meaningful insight.
## fixed.acidity volatile.acidity citric.acid
## fixed.acidity 1.00 -0.26 0.67
## volatile.acidity -0.26 1.00 -0.55
## citric.acid 0.67 -0.55 1.00
## residual.sugar 0.11 0.00 0.14
## chlorides 0.09 0.06 0.20
## free.sulfur.dioxide -0.15 -0.01 -0.06
## total.sulfur.dioxide -0.11 0.08 0.04
## density 0.67 0.02 0.36
## pH -0.68 0.23 -0.54
## sulphates 0.18 -0.26 0.31
## alcohol -0.06 -0.20 0.11
## quality 0.12 -0.39 0.23
## residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity 0.11 0.09 -0.15
## volatile.acidity 0.00 0.06 -0.01
## citric.acid 0.14 0.20 -0.06
## residual.sugar 1.00 0.06 0.19
## chlorides 0.06 1.00 0.01
## free.sulfur.dioxide 0.19 0.01 1.00
## total.sulfur.dioxide 0.20 0.05 0.67
## density 0.36 0.20 -0.02
## pH -0.09 -0.27 0.07
## sulphates 0.01 0.37 0.05
## alcohol 0.04 -0.22 -0.07
## quality 0.01 -0.13 -0.05
## total.sulfur.dioxide density pH sulphates alcohol
## fixed.acidity -0.11 0.67 -0.68 0.18 -0.06
## volatile.acidity 0.08 0.02 0.23 -0.26 -0.20
## citric.acid 0.04 0.36 -0.54 0.31 0.11
## residual.sugar 0.20 0.36 -0.09 0.01 0.04
## chlorides 0.05 0.20 -0.27 0.37 -0.22
## free.sulfur.dioxide 0.67 -0.02 0.07 0.05 -0.07
## total.sulfur.dioxide 1.00 0.07 -0.07 0.04 -0.21
## density 0.07 1.00 -0.34 0.15 -0.50
## pH -0.07 -0.34 1.00 -0.20 0.21
## sulphates 0.04 0.15 -0.20 1.00 0.09
## alcohol -0.21 -0.50 0.21 0.09 1.00
## quality -0.19 -0.17 -0.06 0.25 0.48
## quality
## fixed.acidity 0.12
## volatile.acidity -0.39
## citric.acid 0.23
## residual.sugar 0.01
## chlorides -0.13
## free.sulfur.dioxide -0.05
## total.sulfur.dioxide -0.19
## density -0.17
## pH -0.06
## sulphates 0.25
## alcohol 0.48
## quality 1.00
By analyzing correlation matrix above, quality seems highly correlated to alcohol, sulphates and citric.acid. In addition, fixed.acidity is highly correlated to density and cirtic acid. Total.sulfer.dioxide is highly corralated to free.sulfuer.dioxide which is understandable since total.sulfer.dioxide consists of free.sulfer.dioxide and bond sulfer dioxide.
Based on plot above, it is apparent that medians of quality levels increases as citric acid increases too.
Based on plot above, alcohol does not have a linear relationship with quality although they are highly correlated.
Based on plot above, quality increases as volatile acidity decreases, they are in a negative relationship.
The above suggests that the highly quality increases while correlated sulphates increases.
The above figure indicates that total sulfur dioxide is highly related to free sulfur dioxide in a linear relationship
##
## Call:
## lm(formula = wine_data$total.sulfur.dioxide ~ wine_data$free.sulfur.dioxide,
## data = wine_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -55.120 -13.534 -7.325 7.570 197.126
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.13535 1.11367 11.79 <2e-16 ***
## wine_data$free.sulfur.dioxide 2.09969 0.05858 35.84 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24.5 on 1597 degrees of freedom
## Multiple R-squared: 0.4458, Adjusted R-squared: 0.4454
## F-statistic: 1285 on 1 and 1597 DF, p-value: < 2.2e-16
density vs residual sugar seem have some outliers that affact estimate here, so let’s remove outliers.
The above figures show relationship between density and residual sugar density ~ residual sugar seem forming linear relationship but a close look reveals that outliers influenced estimate. After removing outliers, no strong relationship seems obvious between density and residual sugar
It is obvious that pH is highly related to fixed acidity in a linear relationship.
##
## Call:
## lm(formula = wine_data$pH ~ wine_data$fixed.acidity, data = wine_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.51780 -0.06547 0.00164 0.06488 0.52207
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.814959 0.013776 276.93 <2e-16 ***
## wine_data$fixed.acidity -0.060561 0.001621 -37.37 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1128 on 1597 degrees of freedom
## Multiple R-squared: 0.4665, Adjusted R-squared: 0.4661
## F-statistic: 1396 on 1 and 1597 DF, p-value: < 2.2e-16
From above plot, pH is also highly related to citric acid in a negative linear relationship.
##
## Call:
## lm(formula = wine_data$pH ~ wine_data$citric.acid, data = wine_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.50025 -0.07733 -0.00570 0.08251 0.58251
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.427491 0.005562 616.25 <2e-16 ***
## wine_data$citric.acid -0.429477 0.016668 -25.77 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1298 on 1597 degrees of freedom
## Multiple R-squared: 0.2937, Adjusted R-squared: 0.2932
## F-statistic: 664 on 1 and 1597 DF, p-value: < 2.2e-16
The above two model indicates that pH has linear relationship with citric acid and fixed acidity
On the other hand, free sulfur dioxide does not have strong linear relationship with pH as above plot shows.
The quality does have high correlation with alcohol, citric acid and sulphates. However, there are no strong linear relationships among them as figures show in above section. On the other hand, some relationships exist among other features such as pH vs citric acid, pH vs fixed acidity and free sulfur dioxide vs total sulfur dioxide. Those pairs all have noticeable linear relationships.
pH has strong negative relationship with fixed acidity and citric acid while total sulfur dioxide seems having strong positive relationship with free sulfur dioxide. Further study shows that total sulfur dioxide consists of free sulfur dioxide and bond sulfur dioxde.
The strongest relationship is between pH and fixed acidity, they are in a negative linear relationship.
The above plot demostrates pH decreases as fixed acidity increases, it is apparent that pH and fixed acidity fall into a negative linear relationship, while quality does not follow. the quality of red wine seems spread across plot in a random manner.
Further breaking into quality and fixed acidity plot, it is clear that quality and fixed acidity are in a weak linear relationship, because medians flucutates up and down.
From above plot, I can tell that pH values tend to decrease as quality increases in a negatived linear relationship which is stranger than fixed acidity vs quality.
The above plot demostrates pH decreases as citric acidity increases, it is apparent that pH and fixed acidity fall into a negative linear relationship as well, while quality also does not follow their relation. the quality seems spreading across plot in a random manner.
While looking at plot above, it is very clear that citric acidity is in a positive linear relationship with quality since medians go up as quality level increases.
The above plot demostrates density decreases as alcohol increases, it is apparent that density and alcohol fall into a negative linear relationship, while quality does not follow. the quality seems spreading across plot in a random manner.
Further breaking into alcohol vs quality plot, for low level quality, alcohol does not seem related to quality, however, from medium to high level, quality and alcohol are obviously highly related.
Breaking into density and quality plot, I cannot see a strong relationship.
The above plot demostrates total.sulfur.dioxide increases as free.sulfur.dioxide increases, it is apparent that total.sulfur.dioxide and free.sulfur.dioxide fall into a positive linear relationship, while quality does not follow. the quality of red wine seems spreading across plot in a random manner.
Since free.sulfur.dioxide and total.sulfur.dioxide are highly related, and total.sulfur.dioxide contains free.sulfur.dioxide, I can use ratio of free.sulfur.dioxide over total.sulfur.dioxide to represent the feature. The above plot indicates that there are no linear relationship between percentage of sulfer and quality. Out of 12 variables some have linear relationships with quality, therefore, I can try to create a linear model as following.
##
## Calls:
## m1: lm(formula = q ~ r, data = wine_data)
## m2: lm(formula = q ~ r + f, data = wine_data)
## m3: lm(formula = q ~ r + f + a, data = wine_data)
## m4: lm(formula = q ~ r + f + a + p, data = wine_data)
## m5: lm(formula = q ~ r + f + a + p + c, data = wine_data)
##
## ========================================================================================
## m1 m2 m3 m4 m5
## ----------------------------------------------------------------------------------------
## (Intercept) 5.249*** 4.622*** 1.149*** 3.298*** 3.143***
## (0.053) (0.114) (0.195) (0.595) (0.592)
## r 1.013*** 1.116*** 0.533*** 0.570*** 0.653***
## (0.128) (0.128) (0.117) (0.117) (0.117)
## f 0.071*** 0.077*** 0.041** 0.006
## (0.011) (0.010) (0.014) (0.016)
## a 0.350*** 0.363*** 0.340***
## (0.017) (0.017) (0.018)
## p -0.605*** -0.456**
## (0.158) (0.160)
## c 0.588***
## (0.126)
## ----------------------------------------------------------------------------------------
## R-squared 0.038 0.060 0.260 0.267 0.277
## adj. R-squared 0.037 0.059 0.259 0.265 0.274
## sigma 0.792 0.783 0.695 0.692 0.688
## F 62.531 51.300 186.837 144.982 121.847
## p 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -1895.927 -1876.823 -1685.861 -1678.558 -1667.714
## Deviance 1002.896 979.216 771.164 764.152 753.857
## AIC 3797.854 3761.645 3381.723 3369.117 3349.427
## BIC 3813.985 3783.154 3408.609 3401.380 3387.067
## N 1599 1599 1599 1599 1599
## ========================================================================================
free.sulfur.dioxide,alcohol,citric.aci and total.sulfur.dioxide have strong correlation in a linear relationship.fixed.acidity and pH are other variables that moderately to quality, it seems they have weaker relationships and one thing noticeable is that pH is in a negative linear relationship with quality variable
It is interesting and surprising that the quality also relates to pH value in a negative linear relationship. Also, alcohol level seems to play a part in quality of red wine, it is surprising to know that more alcohol better quality it becomes, but I’d assume alcohol level will max out its influence at some point.
Yes, I did. the model created starts with percentage of sulfur and quality. By adding fixed acidity, citric acid, and pH, model gets improved moderately.
Alcohol level plays an essential role in quality, starting from median quality the influence of alcohol becomes clearer than lower end of of quality.
It is obvious that pH decreases with fixed acidity becomes larger. In addition, pH decreases with quality as well.
citric acidity has strong relationship with quality in this case. As median of citiric acidity goes bigger, quality becomes better too. It is surprising that quality is related to citric acidity in red wine. ——
The redwineQuality dataset contains 1599 observations and 12 variables with index column X, and quality variable as class. I started by looking at each individual variables and exploring some interesting features with histdigrams. Then, I looked at relations between pairs of variables to further understand where the relationships are. At the end, I ploted multivariate diagrams to get better understanding of relations among those chosen variables.
There are not very strong relationships, I had difficulties at beginning to understand them, but moderate relations between some variables such as total.sulfur.dioxide vs free.sulfur.dioxide, pH vs quality alcohol vs quality and so on are later found from plots. It is surprising that citric.acid also plays an essential role in making high quality red wine. With those that highly relate to quality I created a linear model. It is however limited to the amount of data, and it could be improved if there are more observations in dataset.